8 research outputs found

    TopCom: Index for Shortest Distance Query in Directed Graph

    Get PDF
    Finding shortest distance between two vertices in a graph is an important problem due to its numerous applications in diverse domains, including geo-spatial databases, social network analysis, and information retrieval. Classical algorithms (such as, Dijkstra) solve this problem in polynomial time, but these algorithms cannot provide real-time response for a large number of bursty queries on a large graph. So, indexing based solutions that pre-process the graph for efficiently answering (exactly or approximately) a large number of distance queries in real-time is becoming increasingly popular. Existing solutions have varying performance in terms of index size, index building time, query time, and accuracy. In this work, we propose T OP C OM , a novel indexing-based solution for exactly answering distance queries. Our experiments with two of the existing state-of-the-art methods (IS-Label and TreeMap) show the superiority of T OP C OM over these two methods considering scalability and query time. Besides, indexing of T OP C OM exploits the DAG (directed acyclic graph) structure in the graph, which makes it significantly faster than the existing methods if the SCCs (strongly connected component) of the input graph are relatively small

    Neural‑Brane: Neural Bayesian Personalized Ranking for Attributed Network Embedding

    Get PDF
    Network embedding methodologies, which learn a distributed vector representation for each vertex in a network, have attracted considerable interest in recent years. Existing works have demonstrated that vertex representation learned through an embedding method provides superior performance in many real-world applications, such as node classification, link prediction, and community detection. However, most of the existing methods for network embedding only utilize topological information of a vertex, ignoring a rich set of nodal attributes (such as user profiles of an online social network, or textual contents of a citation network), which is abundant in all real-life networks. A joint network embedding that takes into account both attributional and relational information entails a complete network information and could further enrich the learned vector representations. In this work, we present Neural-Brane, a novel Neural Bayesian Personalized Ranking based Attributed Network Embedding. For a given network, Neural-Brane extracts latent feature representation of its vertices using a designed neural network model that unifies network topological information and nodal attributes. Besides, it utilizes Bayesian personalized ranking objective, which exploits the proximity ordering between a similar node pair and a dissimilar node pair. We evaluate the quality of vertex embedding produced by Neural-Brane by solving the node classification and clustering tasks on four real-world datasets. Experimental results demonstrate the superiority of our proposed method over the state-of-the-art existing methods

    E-CLoG: Counting edge-centric local graphlets

    Get PDF
    In recent years, graphlet counting has emerged as an important task in topological graph analysis. However, the existing works on graphlet counting obtain the graphlet counts for the entire network as a whole. These works capture the key graphical patterns that prevail in a given network but they fail to meet the demand of the majority of real-life graph related prediction tasks such as link prediction, edge/node classification, etc., which require to build features for an edge (or a vertex) of a network. To meet the demand for such applications, efficient algorithms are needed for counting local graphlets within the context of an edge (or a vertex). In this work, we propose an efficient method, titled E-CLOG, for counting all 3,4 and 5 size local graphlets with the context of a given edge for its all different edge orbits. We also provide a shared-memory, multi-core implementation of E-CLOG, which makes it even more scalable for very large real-world networks. In particular, We obtain strong scaling on a variety of graphs (14x-20x on 36 cores). We provide extensive experimental results to demonstrate the efficiency and effectiveness of the proposed method. For instance, we show that E-CLOG is faster than existing work by multiple order of magnitudes; for the Wordnet graph E-CLOG counts all 3,4 and 5-size local graphlets in 1.5 hours using a single thread and in only a few minutes using the parallel implementation, whereas the baseline method does not finish in more than 4 days. We also show that local graphlet counts around an edge are much better features for link prediction than well-known topological features; our experiments show that the former enjoys between 10% to 45% of improvement in the AUC value for predicting future links in three real-life social and collaboration networks

    Predicting interval time for reciprocal link creation using survival analysis

    Get PDF
    The majority of directed social networks, such as Twitter, Flickr and Google+, exhibit reciprocal altruism, a social psychology phenomenon, which drives a vertex to create a reciprocal link with another vertex which has created a directed link toward the former. In existing works, scientists have already predicted the possibility of the creation of reciprocal link—a task known as “reciprocal link prediction”. However, an equally important problem is determining the interval time between the creation of the first link (also called parasocial link) and its corresponding reciprocal link. No existing works have considered solving this problem, which is the focus of this paper. Predicting the reciprocal link interval time is a challenging problem for two reasons: First, there is a lack of effective features, since well-known link prediction features are designed for undirected networks and for the binary classification task; hence, they do not work well for the interval time prediction; Second, the presence of ever-waiting links (i.e., parasocial links for which a reciprocal link is not formed within the observation period) makes the traditional supervised regression methods unsuitable for such data. In this paper, we propose a solution for the reciprocal link interval time prediction task. We map this problem to a survival analysis task and show through extensive experiments on real-world datasets that survival analysis methods perform better than traditional regression, neural network-based models and support vector regression for solving reciprocal interval time prediction

    Feature Selection for Classification under Anonymity Constraint

    Get PDF
    Over the last decade, proliferation of various online platforms and their increasing adoption by billions of users have heightened the privacy risk of a user enormously. In fact, security researchers have shown that sparse microdata containing information about online activities of a user although anonymous, can still be used to disclose the identity of the user by cross-referencing the data with other data sources. To preserve the privacy of a user, in existing works several methods (k-anonymity, l-diversity, differential privacy) are proposed that ensure a dataset which is meant to share or publish bears small identity disclosure risk. However, the majority of these methods modify the data in isolation, without considering their utility in subsequent knowledge discovery tasks, which makes these datasets less informative. In this work, we consider labeled data that are generally used for classification, and propose two methods for feature selection considering two goals: first, on the reduced feature set the data has small disclosure risk, and second, the utility of the data is preserved for performing a classification task. Experimental results on various real-world datasets show that the method is effective and useful in practice

    Solving Time Prediction Problems in Networks Using Graphlets and Embedding Based Local Features

    No full text
    Real-world networks are inherently dynamic; vertices and edges of these networks appear or disappear in a systematic pattern over time. Hence, the exact time at which the network elements, such as, the vertices or the edges appear or disappear is important information for modeling such networks. Additionally, as we predict a future event (say, link generation) on the network, it is also important to predict the exact time of that future event, because the availability of the event time makes the event prediction more valuable in terms of real-life utility of that prediction. Unfortunately, existing works on dynamic networks do not consider the time value; neither do they use the time information for modeling the network nor do they predict the time of a future event. In this thesis, I have solved the event time prediction variant of multiple wellknown problems in social networks. For instance, link prediction is a well known and possibly the most-studied problem in network analysis. But, the existing solutions to this task only predict whether a future link will appear or not—on some occasions, with a probability value. Unlike the existing works, in my thesis, I have proposed machine learning solutions that answer the question, “when will a link appear?”, instead of answering whether a link will appear. I have also developed methods to use the time value of a link for the network modeling task, specifically, for learning features of a network element (a vertex, or an edge), which can subsequently be used for predicting the time value of a future event in the network. As I solve the time prediction problem in the network, I target to predict the link creation time in both directed and undirected networks. To put it succinctly, this thesis opens up a new dimension in the network analysis task, where the time of the event has been considered explicitly both in the modeling and also in the prediction. The specific time prediction problems that I have designed are discussed below. The first problem is Reciprocal Link Time Prediction (RLTP) problem, which is designed to predict the creation times of reciprocal links. RLTP is a versatile tool which can be applied to many real-world applications. For example, RLTP can be used to predict the elapsed time between receiving an email and sending its reply, or it can be used to determine the follow-back time or friend request acceptance time in online social networks. The second problem is Triangle Completion Time Prediction (TCTP) problem, which is designed to predict the creation time of a link that completes one or more triangles. A triangle is a prominent and basic building block of social networks and it is shown by researchers that the majority of new links created in social networks complete a triangle(s) in the network. Hence, a good solution of this problem can effectively improve the performance of various network analysis problems such as the link prediction problem, network expansion study, network generation models, community structure generation. Also, a solution of this problem provides an ordering of the future links based on their creation time, which is very useful to rank the user recommendations in different domains. Lastly, time prediction is the main theme of my research, but the machine learning solutions to such prediction problems require effective and efficient feature generation schemes, which can scale to large, and complex networks. So, some of my published works, which are also part of this thesis, focused on building effective features for network time prediction

    Triangle counting in large networks: a review

    No full text
    Counting and enumeration of local topological structures, such as triangles, is an important task for analyzing large real‐life networks. For instance, triangle count in a network is used to compute transitivity—an important property for understanding graph evolution over time. Triangles are also used for various other tasks completed for real‐life networks, including community discovery, link prediction, and spam filtering. The task of triangle counting, though simple, has gained wide attention in recent years from the data mining community. This is due to the fact that most of the existing algorithms for counting triangles do not scale well to very large networks with millions (or even billions) of vertices. To circumvent this limitation, researchers proposed triangle counting methods that approximate the count or run on distributed clusters. In this paper, we discuss the existing methods of triangle counting, ranging from sequential to parallel, single‐machine to distributed, exact to approximate, and off‐line to streaming. We also present experimental results of performance comparison among a set of approximate triangle counting methods built under a unified implementation framework. Finally, we conclude with a discussion of future works in this direction
    corecore